Search CORE

2 research outputs found

A design method for supporting the development and integration of ARTful global schedulers into multiple programming models

Author: Santana Alexandre de Limas
Publication venue
Publication date: 01/01/2019
Field of study

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2019.Plataformas de execução em ambientes de alto desempenho estão tornando-se cada vez mais diversas com o desenvolvimento de novas arquiteturas e ferramentas para se beneficiar do paralelismo inerente das aplicações. Estas novas opções de ferramentas oferecem possibilidades de melhoria no desempenho de aplicações científicas e de engenharia. A crescente gama de plataformas torna mais complexa a distribuição das tarefas da aplicação no ambiente, passo que deve ser gerido pelos sistemas de execução e seus balanceadores de carga, de modo a não prejudicar a portabilidade da aplicação. No entanto, o desenvolvimento e implantação de novos escalonadores de tarefas em sistemas amplamente utilizados na indústria como OpenMP e MPI sofrem com a falta de suporte de frameworks. Este trabalho propõe uma biblioteca, MOGSLib, para auxiliar o desenvolvimento e implantação de escalonadores globais em diferentes sistemas de execução de alto desempenho. MOGSLib aplica abstrações reutilizáveis, independentes e testáveis na forma de classes template em C++ para representar as políticas de escalonamento, sua relação com o sistema de execução e a taxonomia da solução de escalonamento. Nós avaliamos o sobrecusto de nossas estratégias portáveis em comparação com suas versões nativas nos sistemas de execução em dois ambientes distintos, os sistemas Charm++ e OpenMP. Em nosso experimentos, conseguimos dar suporte à definição de estratégias de escalonamento do usuário e sua implantação como escalonadores de laço na LibGOMP e balanceadores de carga no Charm++ através da seleção de implementação das abstrações que os compõem. Nós verificamos que nossas versões dos escalonadores oferecem tempos de execução de aplicação equivalentes aos balanceadores nativos para a classe de escalonadores centralizados e cientes de carga aplicados em kernels de dinâmica molecular. Por fim, a flexibilidade de incorporar funcionalidades e políticas de escalonamento do usuário em sistemas de execução com modificações limitadas nos códigos de sistemas de execução mostra que é possível construir balanceadores de carga flexíveis com pouco sobrecusto até para ambientes de alto desempenho.Abstract: Execution platforms for high performance computing are becoming diverse as a result of new architectures and tools to benefit from the parallel behavior of applications. These new options showcase performance enhancing opportunities for scientific and engineering applications. The execution platform diversity and the mapping of an application's tasks must be implicitly handled by runtime systems and their global schedulers as to enable the application performance and implementation portability. However, the development and integration of novel global schedulers into industry standards systems like OpenMP and MPI lack framework support. This work proposes a library, MOGSLib, to support the development and integration of global schedulers into different high performance runtime systems. MOGSLib employs reusable, independent and testable abstractions represented as C++ template structures to express scheduling policies, their relationship to runtime systems and the scheduling solution taxonomy. This approach allows a bottom-up development process based on the incremental composition of abstractions through template specializations. We evaluate the overhead of employing our portable policies in comparison to their system-native counterparts in two environments, the Charm++ and OpenMP systems. Throughout our experiments we achieved development support for user-defined scheduling policies that can be implanted both as loop schedulers in LibGOMP and load balancers in Charm++ through the selection of abstraction implementations. We leveraged the overhead by employing workload-aware policies on molecular dynamics kernels which resulted in equivalent application makespan for both native and the MOGSLib scheduler versions. Ultimately, the flexibility to incorporate user-defined structures and scheduling policies into runtime systems with limited alterations into runtime system code bases hint that the definition of flexible global schedulers is available with neglible overheads even for high performance environments

Repositório Institucional da UFSC

Efficient direct convolution using long SIMD instructions

Author: Armejach Sanosa Adrià
Casas Guix Marc
Limas Santana Alexandre de
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/01/2023
Field of study

This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.This work receives EuroHPC-JU funding under grant no. 101034126, with support from the Horizon2020 program. Adrià Armejach is a Serra Hunter Fellow and has been partially supported by the Grant IJCI-2017-33945 funded by MCIN/AEI/10.13039/501100011033. Marc Casas has been par-tially supported by the Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future. This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project and the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC